Gusztav Morvai: Guessing the Output of a 
Stationary Binary Time Series. 



In: Foundations of statistical inference (Shoresh, 2000), pp. 207- 
215, Contrib. Statist., Physica, Heidelberg, 2003. 



Abstract 

The forward prediction problem for a binary time series {X n }^ is to estimate 
the probability that X n+ \ = 1 based on the observations Xi, < i < n without prior 
knowledge of the distribution of the process {X n }. It is known that this is not possible 
if one estimates at all values of n. We present a simple procedure which will attempt 
to make such a prediction infinitely often at carefully selected stopping times chosen 
by the algorithm. The growth rate of the stopping times is also exhibited. 

1 Introduction 

T. Cover in [3] asked two fundamental questions concerning estimation for stationary and 
ergodic binary processes. Cover's first question was as follows. 

Question 1 Is there an estimation scheme f n+ i for the value P{X X = 1|X ,X_ 1 , . . . ,X_ n ) 
such that f n+ i depends solely on the observed data segment X ,X_i, . . . ,X_ n and 

Urn f n +i(X , . . . , X. n ) - P(X 1 = 1\X Q , . . . , X_ n ) = 
almost surely for all stationary and ergodic binary time series {X n } ?. 

This question was answered by Ornstein [7] by constructing such a scheme. (See also Bailey 
[2].) Ornstein's scheme is not a simple one and the proof of consistency is rather sophisti- 
cated. A much simpler scheme and proof of consistency were provided by Morvai, Yakowitz, 
Gy6rfi [6]. (See also Weiss [12].) 
Here is Cover's second question. 

Question 2 Is there an estimation scheme f n +i for the value P(X n+ i = l\X ,Xi, . . . ,X n ) 
such that f n+ i depends solely on the data segment X , A 1; . . . , X n and 

lim f n+ \(Xo, X±, . . . , X n ) — P(X n+ \ = 1\Xq, Xi, . . . , X n ) = 



almost surely for all stationary and ergodic binary time series {X n } ?. 

This question was answered by Bailey [2] in a negative way, that is, he showed that there 
is no such scheme. (Also see Ryabko [10], Gyorfi, Morvai, Yakowitz [4] and Weiss [12].) 
Bailey used the technique of cutting and stacking developed by Ornstein [8] (see also Shields 
[11]). Ryabko's construction was based on a function of an infinite state Markov-chain. This 
negative result can be interpreted as follows. Consider a weather forecaster whose task it is 
to predict the probability of the event 'there will be rain tomorrow' given the observations 
up to the present day. Bailey's result says that the difference between the estimate and the 
true conditional probability cannot eventually be small for all stationary weather processes. 
The difference will be big infinitely often. These results show that there is a great difference 
between Questions 1 and 2. Question 1 was addressed by Morvai, Yakowitz, Algoet [5] and 
a very simple estimation scheme was given which satisfies the statement in Question 1 in 
probability instead of almost surely. Now consider a less ambitious goal than Question 2: 

Question 3 Is there a sequence of stopping times {A n } and an estimation scheme f n which 
depends on the observed data segment (X ,Xi, . . . ,X\ n ) such that 

Jim (f n (X , X u . . . , X Xn ) - P(X Xn+1 = 1\X , X 1 ,..., X X J) = 
almost surely for all stationary binary time series {X n } ? 

It turns out that the answer is affirmative and such a scheme will be exhibited below. This 
result can be interpreted as if the weather forecaster can refrain from predicting, that is, he 
may say that he does not want to predict today, but will predict at infinitely many time 
instances, and the difference between the prediction and the true conditional probability will 
vanish almost surely at the stopping times. 

2 Forward Estimation for Stationary Binary Time Se- 
ries 

Let {X n }'^L_ OD denote a two-sided stationary binary time series. For n > to, it will be 
convenient to use the notation X^ = (X m , . . . , X n ). For k = 1, 2, . . ., define the sequences 
{r k } and {A^} recursively. Set A = 0. Let 

r fc = min{t>0:X, Afc - 1+ * = X Afc - 1 } 

and 

Afe = T k + A fc _i. 

(By stationarity, the string Xq*^ 1 must appear in the sequence Xf° almost surely. ) The 
fcth estimate of P(X Afe+1 = l|Xo fc ) is denoted by P k , and is defined as 

^ k-l 

Pk = j~y Xx j +1 t 1 ) 
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For an arbitrary stationary binary time series {Y n } n= _ (yo , for k — 1, 2, . . ., define the sequence 
r k and Afc recursively. Set Ao = 0. Let 

f k = min{t > : Y~! = F°c } 

and let 

A& = T fc + Afc_i. 

When there is ambiguity as to which time series f k and A^ are to be applied, we will use the 
notation f fe (Y"°J and A fc (Y"°J. 

It will be useful to define another time series {X„}° = _ 00 as 

X°_ Xk := X Afc for all k > 1. (2) 

Since ^fc+i-A fc = -^o* the above definition is correct. Notice that it is immediate that 
f fc (X° J = r k and A fe (X° J = A fc . 

Lemma 1 The two time series {X n }° n= _ (yo and {X n }^ = _ OQ /iave identical distribution, that 
is, for all n>0, and x°_ n E {0, 

P(X° n = x\) = P(X°_ n = x°_ n ). 

Proof First we prove that 

P(X\ = *° „, A*^) = n) = P(X\ = x°_ n , A*(X° J = n). (3) 
Indeed, by (2), ) = X$ \ and it yields 

P{X\ = x°_ n: X k (X°_J =n) = P(X " = x\, \ k = n), 
and by stationarity, 

P(X " = x°_ n , \ k = n) = P(X°_ n = x\, \ k (X°_J = n) 
and (3) is proved. Apply (3) in order to get 
P(X\ = x°_ n ) 

oo 

= Y,P(X-n = X°-n,UX _J=j) 
j=n 

oo 

= E E P(x^ = x%,\ n (x^j= 3 ) 

oo 

= E E P(X°- j =x _ j X(X°-oo)=j) 

J =n xl7 _1 6{0,l}J-» 



= E^-n = ^-n,A n (X° 00 )= J ) 
j'=n 

= P(X° n = x°J 
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and Lemma 1 is proved. 



Since {^n}^_oo i s a stationary time series, by Lemma 1 so is {X n }° = _ 00 . Since a station- 
ary time series can always be extended to be a two-sided time series we have also defined 
{-Xn}£L_oo- Now we prove the universal consistency of the estimator P k . 

Theorem 1 For all stationary binary time series {X n } and estimator defined in (1), 

lim (P k - P{X Xk+ i = l\Xo k )) = almost surely. (4) 



Moreover, 

almost surely. 
Proof 



hm P k = lim P(X Xk+1 = 1\X^) = P(X ± = J (5) 

k^oo fe— >oo 



Pk - P(X Xk+ i - i\x^ k ) 



1 

k - 


k-l 

i J2i X ^+i P(X Xj +i - 

1 3=1 


1|^o Aj )]} 


1 

k - 


T EV(^a j+ i = 1|X A 0- 

1 3=1 


- P( x \k+1 


1 

k - 


k-l I k-l 

1 i=i ft 1 j=i 


A fc ). 



Observe that {Tj, cr(Xg J+1 )} is a bounded martingale difference sequence for f < j < oo. 
To see this note that o~(Xq j+1 ) is monotone increasing, and Tj is measurable with respect to 
a(XQ J+1 ), and E(Tj\XQ ] ~ 1+1 ) = for f < j < oo. Now apply Azuma's exponential bound 
for bounded martingale differences in Azuma [1] to get that for any e > 0, 



P 



-i k-l 

— —Y r 



> e < 2exp(-e 2 (A;- l)/2). 



After summing the right side over k, and appealing to the Borel-Cantelli lemma for a sequence 
of e's tending to zero we get 



^ k-l 
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It remains to show 

^ k-l 



— ^2 — > almost surely. 

i k - 1) j=l 



— - ^2 Aj — A k — > almost surely. 
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Define 

Pk , n (x°_ n ) = P(X Xk+1 = l|X Afc = x°_ n , \ k = n) 
and (applying \ k to the time series {^n}°=-oo) 

p k , n {x°_ n ) = P(X, = l\X\ = x°_ n) \ k = n). 
Now the fact that \ k = X k and Lemma 1 together imply 

Pk,n( X -n) =Pk,n( X -n)- (6) 

By (2) and (6), 

P^K)=P k ;x k (X-xJ- ( 7 ) 

Combine (6) and (7) in order to get 

P(X Xk+1 = l\X^) = P(X 1 = l\X\). 

Notice that {P(Xi = 1\X°_ i ), cr(A^ )} is a bounded martingale and so it converges almost 

surely to P{X\ = ljX^), and so does P(X\ k+ i = l|Xo fc ). We have proved that Aj converges 
almost surely. Now Toeplitz lemma yields that ^ziYl'jZK^-j — ~^ almost surely The 
proof of Theorem 1 is complete. 



3 The Growth Rate of the Stopping Times 

The next result shows that the growth of the stopping times {A^} is rather rapid. Let 

P(x°-n)=P(X°_ n = X _ n ). 

Theorem 2 Let {X n } be a stationary and ergodic binary time series. Suppose that H > 
where 

H= lim )-Elogp(X ,...,X n ) 

n->oo 77,-|-l 

is the process entropy. Let < e < H be arbitrary. Then for k large enough, 

Afe(o;) > c c almost surely, (8) 

where the height of the tower is k — K , K{uj) is a finite number which depends on oj, and 
c = 2 H ~\ 



CO 

oo 



PROOF Since by (2), \ k = A/^X ^), and by Lemma 1 the time series {X n }-oo an d {X n } 
have identical distributions, and hence the same entropy, it is enough to prove the result for 
^fe(^-oo)- Now f k and \ k are evaluated on the process {^n}n=-oo- For < I < oo define 

R(l) = mm{j >l + l: X^_. = X°_ t }. 
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By Ornstein and Weiss [9], 

1 



Z + l 



log R(l) — > H almost surely. (9) 



First we show that if H > then for k large enough ffc+i > X k almost surely. We argue by 
contradiction. Suppose that T k+ i — > oo and ffe+i < A& infinitely often. Then 



and fjfc+i < Afc infinitely often. Hence 

infinitely often and i?(ffc + i — 1) < r fc+1 infinitely often. Then by (9), 

H = lim -J— log R(r k +i ~ 1) 

fe^oo T k+ i 

< lim - — logf fc+ i 

fc-+oo T k+1 

= 

provided that f k — > oo. Now assume that 77 = sup 0<fc<oo ffc is finite. Then R(nr] — 1) = 7177. 
Now by (9), 

H = lim — log R(ni] — 1) 

< lim — logfnr/) 
= 0. 

We have shown that H > implies that for k large enough ffc+i > A^ almost surely and 
hence for k large enough R(X k ) = T k +i almost surely. Hence by (9), 

log f k+ i — > H almost surely. 



A fc + 1 



Thus for almost every w G O there exists a positive finite integer K(oo) such that for k > 
K ( u "> ' I^T log fk+1 > H ~ e and 

Afc+i > h+i > c Xk for k > K(u) 
and the proof of Theorem 2 is complete. 
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4 Guessing the Output at Stopping Time Instances 



If the weather forecaster is pressed to say simply will it rain or not tomorrow then we need 
a guessing scheme, rather than a predictor. Define the guessing scheme {X Xk } for the values 
{X Afc+1 } as 

x ^ k = l {P k >0-5}- 

Let X Xk denote the Bayes rule, that is, 

X A fc = 1 {P(*A fc +i=l|*£*)>0.5}- 



Theorem 3 Let {X„}^ = _ 00 be a stationary binary time series. The proposed guessing 
scheme X Xk works in the average at stopping times \k just as well as the Bayes rule, that is, 

(l n 1 n \ 

J™ (-]2hx H =x Xk+l} --EH=^ +l} J = o (io) 



almost surely. Moreover, 

hm (P(X Xk = X Xk+l \X^) - P(X* Xk = X Xk+l \X^)) = (11) 



almost surely. 
Proof 



n \ n 

E l {x Xk =x Xk+1 ) - - E hxi =x Xk+1 } = 
k=i u k=i 

~ E [ l {x Xk =x Xk+1 } - P(X Xk = X Afc+1 |X Afc ) 
u k=l 

- \t [hx k =x Xk+l} ~ P{K k = X Xk+1 \Xt) 
n k=i 
1 n 

+ - E [nxx k = X Xk+1 \X^) - P(X* Xk = X Xk+1 \X^) 
n k=l 

= r n + e„ + ^ n . 

Now r„ and 0„ tend to zero since they are averages of bounded martingale differences (cf. 
Azuma [1]). Concerning the third term \l/ n , it is enough to prove that 

hm (P(X Xk = X Xk+1 \X^) - P(X* Xk = X Xk+1 \X^)) = 

almost surely. To see this recall the result in Theorem 1, 

hm P k = hm P(X Xk+1 = 1\X^) = P(X, = l\X°_J 

k^oo fe^oo 
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almost surely, and apply this in order to get 

lim [P(X Xk = X Xk+1 \X^) - P(X* Xk = X Xk+1 \X^)} = 

lim^ {[P(P(Xx = 1\X°_J ? 0.5,X Afc = X Xk+l \X^) 

- P(P(X 1 = llX^J ? 0.5,X* Xk = X Xk+1 \X^)} 
+ [P{P(X X = 1|X° J = 0.5,X Afe = X Xk+1 \X^) 

- P{P{X 1 = 1\X°_J = 0.5,X* fc = x Xk+1 \x^)]} 
0. 

The proof of Theorem 3 is now complete. 
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